simon-5502-14-slides

What did we learn in biostats-2, module 02?

Three models

  • All use a continuous dependent (outcome) variables
  • All include multiple independent variables
  • Multiple linear regression (Week 02)
    • All independent variables are continuous
  • Analysis of covariance (Week 03)
    • Mix of continuous and categorical independent variables
  • Multi-factor analysis of variance (Week 04)
    • All independent variables are categorical

Why three models?

  • Historical precedents
  • Different issues
    • Multicollinearity
    • Mediator variable
    • Risk adjustment
    • Moderator variable
    • Interactions

The general linear model

  • Single model that unites all three models.
  • Use of indicator variables for categorical data
  • Not the same as general IZED linear model
    • SAS: proc glm versus proc glim
    • R: lm() versus glm()

Arguments for the lm() function

  • formula = \(dependent-variable\) ~ \(independent-variables\)
    • \(independent-variables\) can be numeric, factors, or strings
  • data =
  • subset =
  • na.action =
    • na.fail
    • na.omit
    • na.exclude
  • other arguments

What did we learn in biostats-2, module 03?

What is a covariate?

  • Variable not of direct interest
    • Relationship to outcome is already established
    • Still must be accounted for
  • Examples
    • Smoking in a cancer study
    • Gestational in a neonatology study
  • A covariate can be continuous or categorical

What is covariate imbalance?

  • Difference in mean value between treatment and control group
    • Often a problem in observational studies
    • Sometimes a problem in randomized studies

Why is covariate imbalance an issue?

  • Biased estimates
    • Comparing apples to oranges
  • Harms study credibility

Examples of covariate imbalance

  • Age in a study of smoking and Down’s syndrome
  • Smoking in a study of artillery assignment and sperm count

Covariate imbalance versus confounding

  • Covariate imbalance is simpler
  • Confounder definition relies on causation arguments

Preventing covariate imbalance

  • Randomization
  • Matching
  • Stratification

Adjusting for covariate imbalance

  • Propensity score models
  • Analysis of covariance

Variables cannot be in the causal pathway

  • Fixed at time of randomization
  • Temporally preceding exposure
  • Example: bottles given during a breast feeding study

Adjusting for baseline measurements

  • Baseline = measurements prior to intervention
    • Done to improve precision
    • Can use baseline as a covariate
  • Change score is an alternative
    • Also known as difference in differences (DID) model
    • Possible regression to the mean

What did we learn in module 04?

Mathematical model, 1

  • Decompose \(\mu_{ij}\) into \(\mu + \alpha_i + \beta_j\)
    • \(\alpha_i\) is the deviation for the ith level of first factor
    • \(\beta_j\) is the deviation for the jth level of second factor
    • Require \(\alpha_1=0\) and \(\beta_1=0\)
    • \(\mu\) is the mean for the reference levels

Mathematical model, 2

  • \(Y_{ijk} = \mu + \alpha_i + \beta_j +\epsilon_{ijk}\)
    • i=1,…,a levels of the first categorical variable
    • j=1,…,b levels of the second categorical variable
    • k=1,…,n replicates with first and second categories
  • Note: \(\mu, \alpha_i, \beta_j, \epsilon_{ijk}\) are population values

Mathematical model, 3

  • \(H_0:\ \alpha_i=0\) for all i

\(\ \)

  • \(H_0:\ \beta_j=0\) for all j

Parameter estimates for the two factor model

# A tibble: 14 × 5
   term        estimate std.error statistic p.value  
   <chr>          <dbl>     <dbl>     <dbl> <glue>   
 1 (Intercept)    4.72      1.50      3.14  p = 0.005
 2 MoonDuring     2.5       0.984     2.54  p = 0.019
 3 MoonAfter      0.542     0.984     0.550 p = 0.588
 4 MonthSep       4.03      1.97      2.05  p = 0.053
 5 MonthOct       3.73      1.97      1.90  p = 0.071
 6 MonthNov       3.70      1.97      1.88  p = 0.073
 7 MonthDec       2.63      1.97      1.34  p = 0.195
 8 MonthJan       5.03      1.97      2.56  p = 0.018
 9 MonthFeb       6.93      1.97      3.52  p = 0.002
10 MonthMar       8.57      1.97      4.35  p < 0.001
11 MonthApr      13.0       1.97      6.61  p < 0.001
12 MonthMay       8.60      1.97      4.37  p < 0.001
13 MonthJun       7.10      1.97      3.61  p = 0.002
14 MonthJul      11.0       1.97      5.61  p < 0.001

Analysis of variance table comparing the two factor model to the null model

# A tibble: 2 × 7
  term                     df.residual   rss    df sumsq statistic p.value  
  <chr>                          <dbl> <dbl> <dbl> <dbl>     <dbl> <glue>   
1 Admission ~ 1                     35  625.    NA   NA      NA    <NA>     
2 Admission ~ Moon + Month          22  128.    13  497.      6.58 p < 0.001

Analysis of variance table comparing the two factor model to the one factor model

# A tibble: 2 × 7
  term                     df.residual   rss    df sumsq statistic p.value  
  <chr>                          <dbl> <dbl> <dbl> <dbl>     <dbl> <glue>   
1 Admission ~ Moon                  33  583.    NA   NA      NA    <NA>     
2 Admission ~ Moon + Month          22  128.    11  456.      7.13 p < 0.001

R-squared values

# A tibble: 3 × 3
  model  r.squared deviance
  <glue>     <dbl>    <dbl>
1 m1        0          625.
2 m2        0.0664     583.
3 m3        0.795      128.

Tukey post hoc test

# A tibble: 3 × 7
  term  contrast      null.value estimate conf.low conf.high adj.p.value
  <chr> <chr>              <dbl>    <dbl>    <dbl>     <dbl> <chr>      
1 Moon  After-Before           0    0.542  -1.93        3.01 0.847      
2 Moon  During-Before          0    2.50    0.0280      4.97 0.047      
3 Moon  During-After           0    1.96   -0.514       4.43 0.138      

What did we learn in module 05?

Mathematical model, 1

  • \(Y_{ijk}=\mu+\alpha_i+\beta_j+(\alpha \beta)_{ij}+\epsilon_{ijk}\)
    • i=1,…,a j=1,…,b, k=1,…,n
  • If 1 is the reference category
    • \(\alpha_1=0\)
    • \(\beta_1=0\)
    • \((\alpha \beta)_{1j}=0\)
    • \((\alpha \beta)_{i1}=0\)

Mathematical model, 2

  • \(SS_A=\Sigma_i nb(\bar{Y}_{i..}-\bar{Y}_{...})^2\)
  • \(SS_B=\Sigma_i na(\bar{Y}_{.j.}-\bar{Y}_{...})^2\)
  • \(SS_{AB}=\Sigma_i \Sigma_j n(\bar{Y}_{ij.}-\bar{Y}_{i..}-\bar{Y}_{.j.}+ \bar{Y}_{...})^2\)
  • \(SS_E=\Sigma_i \Sigma_j \Sigma_k (Y_{ijk}-\bar{Y}_{ij.})^2\)
  • \(SS_T=\Sigma_i \Sigma_j \Sigma_k (Y_{ijk}-\bar{Y}_{...})^2\)

Test for an interaction

  • \(SS_{AB}\) has (a-1)(b-1) degrees of freedom
  • \(SS_E\) has ab(n-1) degrees of freedom
  • Accept \(H_0\) if \(F=\frac{MS_{AB}}{MS_E}\) is close to one
    • In R, fit a model without an interaction
    • Compare to a model with interaction
    • Using the anova function

What did we learn in module 06?

Comparing two binary outcomes

  • Is there a difference in the proportion of deaths between male passengers and female passengers on the Titanic?
  • Is there difference in the proportion of patients finishing the full three doses of HPV vaccine between Black women and White women?
  • Does using a ng tube for feeding in pre-term infants increase the probability of successful breast feeding at six months?

Other comparisons involving a binary outcome

  • Is there are difference in the proportion of deaths between first class, second class, and third class passengers?
  • Does age influence the proportion of women finishing the full three doses of HPV vaccine?
  • Controlling for the mother’s age, does using a ng tube for feeding in pre-term infants increase the probability of successful breast feeding at six months?

Hypothesis framework

  • \(H_0:\ \pi_1=\pi_2\)
  • \(H_1:\ \pi_1=\pi_2\)
  • Compute \(\hat p_1\) and \(\hat p_2\) from samples
  • Accept \(H_0\) if \(\hat p_1-\hat p_2\) is close to zero.
    • \(T=(\hat p_1-\hat p_2)/s.e.\)
    • 95% CI: \((\hat p_1-\hat p_2) \pm Z_{\alpha/2}s.e.\)

The Titanic dataset

Rows: 1,313
Columns: 5
$ Name     <chr> "Allen, Miss Elisabeth Walton", "Allison, Miss Helen Loraine"…
$ PClass   <chr> "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st", "1st"…
$ Age      <dbl> 29.00, 2.00, 30.00, 25.00, 0.92, 47.00, 63.00, 39.00, 58.00, …
$ Sex      <chr> "female", "female", "male", "female", "male", "male", "female…
$ Survived <dbl> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1…

Counts and percentages

        Survived
Sex      Yes  No
  female 308 154
  male   142 709
        Survived
Sex            Yes        No
  female 0.6666667 0.3333333
  male   0.1668625 0.8331375

Test for difference in proportions

# A tibble: 1 × 9
  estimate1 estimate2 statistic  p.value parameter conf.low conf.high method    
      <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl> <chr>     
1     0.667     0.167      332. 3.43e-74         1    0.450     0.550 2-sample …
# ℹ 1 more variable: alternative <chr>

Chi-square test of independence, 1 of 2

  • Equivalent to test of two proportions
  • Lay out data in two by two table
\[\begin{matrix} & No\ event & Event \\ Treatment & O_{11} & O_{12}\\ Control & O_{21} & O_{22} \end{matrix}\]

Chi-square test of independence, 2 of 2

\[\begin{matrix} & No\ event & Event \\ Treatment & E_{11} = n_1 (1-\hat p_.) & E_{12}=n_1 \hat p_.\\ Control & E_{21} = n_2 (1-\hat p_.) & E_{22}=n_2 \hat p_. \end{matrix}\]
  • \(X^2=\Sigma \frac{(O_{ij}-E_{ij})^2}{E_{ij}}\)

Expected counts for Titanic

Observed counts

        Survived
Sex      Yes  No
  female 308 154
  male   142 709

Expected counts

        Survived
Sex           Yes       No
  female 158.3397 303.6603
  male   291.6603 559.3397

Chisquare test for Titanic

# A tibble: 1 × 4
  statistic  p.value parameter method                    
      <dbl>    <dbl>     <int> <chr>                     
1      332. 3.43e-74         1 Pearson's Chi-squared test

Odds ratio calculation

       No event  Event  Odds
Group1    a         b
Group2    c         d
  • Odds for group 1 = \(b/a\)
  • Odds for group 2 = \(d/c\)
  • Odds for group 1 = \(\frac{d/c}{b/a} = \frac{ad}{bc}\)
  • s.e.(log or) = \(\sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}\)

Titanic data

       Survived   Died  Total
Female   308      154     462
Male     142      709     851
Total    450      863   1,313

Titanic data, odds of death

       Survived   Died  Total  Odds
Female   308      154     462  2     to 1 against
Male     142      709     851  4.993 to 1 in favor
Total    450      863   1,313

Odds ratio = 4.993 / 0.5 = 9.986

Odds ratio for survival by sex

$data
        Survived
Sex      Yes  No Total
  female 308 154   462
  male   142 709   851
  Total  450 863  1313

$measure
        odds ratio with 95% C.I.
Sex      estimate    lower    upper
  female 1.000000       NA       NA
  male   9.956188 7.662525 13.00928

$p.value
        two-sided
Sex      midp.exact fisher.exact   chi.square
  female         NA           NA           NA
  male            0 4.826448e-74 3.425855e-74

$correction
[1] FALSE

attr(,"method")
[1] "median-unbiased estimate & mid-p exact CI"

What did we learn in module 07?

What is a diagnostic test?

  • Indication of disease
    • Rapid, convenient, and/or inexpensive
  • Gold standard
    • Indication of same disease
    • Slow, inconvenient, and/or expensive

Examples of diagnostic tests, 1 of 2

  • Yale-Brown obsessive-compulsive scale
    • Do you often feel sad or depressed?
  • SCOFF questionnaire
    • Five yes/no questions
    • Two or more yes responses

More examples

  • Rectal bleeding as a sign of colorectal cancer
  • Electrocardiogram, QTc dispersion

What did we learn in module 08?

Survival analysis

  • Time to event models
    • Death
    • Relapse
    • Rehospitalization
    • Failure of medical device
    • Pregnancy
  • Not every patient experiences the event
    • These are censored observations

First fruit fly experiment, 1

data_dictionary: fly1.txt
description: |
  This dataset provides a simple example of what survival and censoring. It provides an inuitive explanation of estimation of survival probabilities.
vars:
  day:
    label: Time until death
    unit: days

First fruit fly experiment, 2

37, 40, 43, 44, 45, 47, 49, 54, 56, 58, 59, 60, 61, 62, 68, 70, 71, 72, 73, 75, 77, 79, 89, 94, 96

First fruit fly experiment, 3

  day   p
1  37 96%
2  40 92%
3  43 88%
4  44 84%
5  45 80%
6  47 76%
7  49 72%
8  54 68%
9  56 64%
   day   p
10  58 60%
11  59 56%
12  60 52%
13  61 48%
14  62 44%
15  68 40%
16  70 36%
17  71 32%
18  72 28%
   day   p
19  73 24%
20  75 20%
21  77 16%
22  79 12%
23  89  8%
24  94  4%
25  96  0%

First fruit fly experiment, 4

Second fruit fly experiment, 1

37, 40, 43, 44, 45, 47, 49, 54, 56, 58, 59, 60, 61, 62, 68, ??, ??, ??, ??, ??, ??, ??, ??, ??, ??

Second fruit fly experiment, 2

  day event
1  37     1
2  40     1
3  43     1
4  44     1
5  45     1
6  47     1
7  49     1
8  54     1
9  56     1
   day event
10  58     1
11  59     1
12  60     1
13  61     1
14  62     1
15  68     1
16  70     0
17  70     0
18  70     0
   day event
19  70     0
20  70     0
21  70     0
22  70     0
23  70     0
24  70     0
25  70     0

Second fruit fly experiment, 3

  day event   p
1  37     1 96%
2  40     1 92%
3  43     1 88%
4  44     1 84%
5  45     1 80%
6  47     1 76%
7  49     1 72%
8  54     1 68%
9  56     1 64%
   day event   p
10  58     1 60%
11  59     1 56%
12  60     1 52%
13  61     1 48%
14  62     1 44%
15  68     1 40%
16  70     0    
17  70     0    
18  70     0    
   day event p
19  70     0  
20  70     0  
21  70     0  
22  70     0  
23  70     0  
24  70     0  
25  70     0  

Second fruit fly experiment, 4

Third fruit fly experiment, 1

37, 40, 43, 44, 45, 47, 49, 54, 56, 58, 59, 60, 61, 62, 68, ??, 71, ??, ??, 75, ??, ??, 89, ??, 96

Third fruit fly experiment, 2

  day event
1  37     1
2  40     1
3  43     1
4  44     1
5  45     1
6  47     1
7  49     1
8  54     1
9  56     1
   day event
10  58     1
11  59     1
12  60     1
13  61     1
14  62     1
15  68     1
16  70     0
17  71     1
18  70     0
   day event
19  70     0
20  75     1
21  70     0
22  70     0
23  89     1
24  70     0
25  96     1

Third fruit fly experiment, 3

  day event   p
1  37     1 96%
2  40     1 92%
3  43     1 88%
4  44     1 84%
5  45     1 80%
6  47     1 76%
7  49     1 72%
8  54     1 68%
9  56     1 64%
   day event   p
10  58     1 60%
11  59     1 56%
12  60     1 52%
13  61     1 48%
14  62     1 44%
15  68     1 40%
16  70     0    
17  71     1 30%
18  70     0    
   day event   p
19  70     0    
20  75     1 20%
21  70     0    
22  70     0    
23  89     1 10%
24  70     0    
25  96     1  0%

Third fruit fly experiment, 4

Interpreting Kaplan-Meier plots, 1

Interpreting Kaplan-Meier plots, 2

Interpreting Kaplan-Meier plots, 3

Interpreting Kaplan-Meier plots, 4

What did we learn in module 09?

Meta-analysis

  • Quantitative pooling of results from multiple studies
    • Multi-center study
      • Each center has a different protocol
      • Some centers do not share results
  • Contrast to systematic overview
    • Careful review of multiple studies
    • May or may not include quantitative pooling
  • Contrast to scoping review
    • “Researchers may conduct scoping reviews instead of systematic reviews where the purpose of the review is to identify knowledge gaps, scope a body of literature, clarify concepts or to investigate research conduct.” Munn 2018

Case study: Declining sperm counts

  • Meta-analysis published in 1992 in BMJ
    • 62 studies
    • 1938 through 1991
    • 12 million/ml decline per decade
  • Criticism
    • Early studies in North America, later studies more global
    • Variations in collection methods, abstinence requirements
    • Many competing analyses
  • Bad news/good news

Major issues in meta-analysis

  • Heterogeneity
    • Were apples combined with oranges?
  • Publication bias
    • Were some apples left on the tree?
  • Study quality
    • Were all the apples rotten?
  • Interpretability
    • Did the pile of apples amount to more than just a hill of beans?

What did we learn in module 10?

Talk given to first year medical students

  • Topic also relevant to this class.
  • Only a few minor changes
    • Different format for the “programming” assignment

Who am I?

Steve Simon

  • PhD Statistics, 1982, U Iowa
  • Teach in Biomedical and Health Informatics
    • Previous jobs at CMH, CDC
  • Part-time independent statistical consultant (P.Mean Consulting)
  • Married to a Pediatric Cardiologist (retired)
  • Run 5K and 4 mile races

Obsessed with computers since 1972

Figure 1. Section on computer skills from my resume

Worked with health care applications since 1987

  • Recent positions
    • Centers for Disease Control and Prevention (1987-1996)
    • Children’s Mercy Hospital (1996-2008)
    • UMKC School of Medicine (2008 to present)
  • But…
    • I am not a doctor
    • Still confused about many things
      • Example: Difference between good and bad cholesterol.

Quiz questions (1/3)

Why does Joel Best call statistics a social construct?

  • Statistics are misquoted often on social media.
  • Statistics are selected, shaped, and presented by human beings.
  • Statistics are used to promote socialism.
  • Statistics are dehumanizing.

Quiz questions (2/3)

What is the main philosophical foundation of empiricism?

  • Everything can be reduced to a mathematical equation.
  • Experiments can reveal the realities of the world.
  • Some questions are impossible to answer.
  • We construct our own reality based on our own lived experiences

Quiz questions (3/3)

What is a major problem with data science?

  • Data scientists rely on large amounts of data with uneven quality.
  • Models developed by data scientists can lead to loss of privacy.
  • Prediction models are a black box that can hide discriminatory intent.
  • All of the above.

First poll question

Figure 2. Quote from “Peggy Sue Got Married”

Second poll question

Figure 3. Images of various computers

Are Statisticians Gods?

I’m helping someone who wants an alternative statistical analysis to the one used by the principal investigator. I’m happy to help and will offer advice about why my approach may be better, but I was warned that the PI considers the analysis chosen to be ordained by the “Statistical Gods” at her place of work.

What did we learn in module 11?

Hierarchical data

  • Moving beyond the independence assumption
  • Correlation within clusters

Examples of hierarchical data, 1 of 2

  • Body parts
    • Left eye/right eye
    • Teeth
    • Skin patches
  • Human families
  • Animal litters

Examples of hierarchical data, 2 of 2

  • Clinics/hospitals
  • Communities
  • Repeated measurements

Longitudinal data (topic for next module)

  • Measurements taken at different times
    • Emphasis in changes over time

Between and within cluster comparisons

  • Positive correlation
    • Improves precision of within cluster comparisons
    • Hurts precision of between cluster comparisons
  • Example with litters
    • Medication administered during pregnancy
    • Medication administered after birth

Basic notation, 1 of 2

  • \(Y_{ij}\)
    • i defines cluster
      • i=1,…,a
    • j defines individual within cluster
      • j=1,…,n

Basic notation, 2 of 2

  • \(Y_{ij} = \mu+\alpha_i+\epsilon_{ij}\)
    • \(\mu\) unknown constant
    • \(\alpha_i\) is normally distributed
      • \(SD(\alpha_i)=\sigma_{between}\)
    • \(\epsilon_{ij}\) is normally distributed
      • \(SD(\epsilon_i)=\sigma_{within}\)

Some basic results

  • \(SD(Y_{ij})=\sigma_{total} = \sqrt{\sigma^2_{between}+\sigma^2_{within}}\)
  • \(SD(\bar{Y}_{..})=\sqrt{\frac{\sigma^2_{between}}{a}+\frac{\sigma^2_{within}}{an}}\)
  • \(Corr(Y_{ij}, Y_{ik})=\frac{\sigma^2_{between}}{\sigma^2_{between}+\sigma^2_{within}}\)
    • Intraclass correlation (ICC)

Expected mean squares, 1 of 2

  • \(MS(between) = \frac{1}{a-1}\Sigma n(\bar{Y}_{i.}-\bar{Y}_{..})^2\)
    • \(E[MS(between)] = n\sigma^2_{between}+\sigma^2_{within}\)

Expected mean squares, 2 of 2

  • \(MS(within) = \frac{1}{a(n-1)}\Sigma\Sigma(\bar{Y}_{ij}-\bar{Y}_{i.})^2\)
    • \(E[MS(within)] = \sigma^2_{within}\)

Variance components estimates

  • \(\hat\sigma_{between}^2=\frac{MS(between)-MS(within)}{n}\)
  • \(\hat\sigma_{within}^2=MS(within)\)

What did we learn in module 12?

Longitudinal data

  • Measurements taken at different times
    • Emphasis in changes over time

Random intercepts model, 1

  • Simplest pattern for longitudinal data
  • \(Y_{ij},\ i=1,...,n;\ j=1,...,k\)
    • n subjects, k time points
  • \(t_j\), time of jth measurement
    • First time is often zero

Random intercepts model, 2

  • \(Y_{ij}=\beta_0+u_{0i}+\beta_1 t_j + \epsilon_{ij}\)
    • \(\beta_0\) and \(\beta_1\) are unknown constants
    • \(u_{0i}\) and \(\epsilon_{ij}\) are normally distributed
      • \(SD(u_{0i})=\sigma_{intercept}\)
      • \(SD(\epsilon_{ij})=\sigma_{error}\)

Random intercepts model, 3

  • \(SD(Y_{ij})=\sqrt{\sigma^2_{intercept}\ +\ \sigma^2_{error}}\)
  • \(Corr(Y_{ij}, Y_{im})=\frac{\sigma^2_{intercept}}{\sigma^2_{intercept}\ +\ \sigma^2_{error}}\)

Random intercepts illustrated, 1

Random intercepts illustrated, 2

Illustration of random intercepts with real data

What did we learn in module 13?

A simple example of Bayesian data analysis.

  • ECMO study
  • Treatment versus control, mortality endpoint
    • Treatment: 28 of 29 babies survived
    • Control: 6 of 10 babies survived
    • Source: Jim Albert in the Journal of Statistics Education (1995, vol. 3 no. 3).

Wikipedia introduction

  • P(H|E) = P(E|H) P(H) / P(E)
    • H = hypothesis
    • E = evidence
    • P(H) = prior
    • P(E|H) = likelihood
    • P(H|E) = posterior

Prior distribution

  • Degree of belief
    • Based on previous studies
    • Subjective opinion (!?!)
  • Examples of subjective opinions
    • Simpler is better
    • Be cautious about subgroup analysis
    • Biological mechanism adds evidence
  • Flat or non-informative prior